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Abstract 

. We consider a model in which a trader aims to maximize expected risk-adjusted profit while 

■ trading a single security. In our model, each price change is a linear combination of observed 



factors, impact resulting from the trader's current and prior activity, and unpredictable random 
effects. The trader must learn coefficients of a price impact model while trading. We propose 
a new method for simultaneous execution and learning - the confidence-triggered regularized 
adaptive certainty equivalent (CTRACE) policy - and establish a poly-logarithmic finite-time 
expected regret bound. This bound implies that CTRACE is efficient in the sense that the 
(e, (5)-convergence time is bounded by a polynomial function of I/e and log(I/(5) with high 



■ probability. In addition, we demonstrate via Monte Carlo simulation that CTRACE outperforms 

o: 

, the certainty equivalent policy and a recently proposed reinforcement learning algorithm that 



is designed to explore efficiently in linear-quadratic control problems. 
Key words: adaptive execution, price impact, reinforcement learning, regret bound 

1 Introduction 

A large block trade tends to "move the market" considerably during its execution by either dis- 
turbing the balance between supply and demand or adjusting other market participants' valu- 
ations. Such a trade is typically executed through a sequence of orders, each of which pushes 
price in an adverse direction. This effect is called price impact. Because it is responsible for a 
large fraction of transaction costs, it is important to design execution strategies that effectively 
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manage price imp act. In light of this, academics and practitioners hav e devoted significant atten- 



tion to the topic 
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The learning of a price impact model poses a challenging problem. Price impact represents an 
aggregation of numerous market participants' interpretations of and reactions to executed trades. 
As such, learning requires "excitation" of the market, which can be induced by regular trading 
activity or trades deliberately designed to facilitate learning. The trader must balance the short 
term costs of accelerated learning against the long term benefits of an accurate model. Further, 
given the continual evolution of trading venues and population of market participants, price impact 
models require retuning over time. In this paper, we develop an algorithm that learns a price 
impact model while guiding trading decisions using the model being learned. 

Our problem can be viewed as a special case of reinforcement learning. This topic more broadly 
addresses sequential decision problems in whi ch unknown prope rties of an environment must be 



learned in the course of operation (see, e.g.. 



Sutton and Bartd ). Research in this area has es- 



tablished how judicious investments in decisions that explore the environment at the expense of 
suboptimal short-term behavior can greatly improve longer-term performance. What we develop 
in this paper can be viewed as a reinforcement learning algorithm; the workings of price impact are 
unknown, and exploration facilitates learning. 

In reinforcement learning, one seeks to optimize the balance between exploration and exploita- 
tion - the use of what has already been learned to maximize rewards without regard to further 
learning. Certainty equivalent control (CE) represents one extreme where at any time, current 
point estimates are assumed to be correct and actions are made accordingly. This is an instance 
of pure exploitation; though learning does progress with observations made as the system evolves, 
decisions are not deliberately oriented to enhance learning. 

An important question is how aggressively a trader should explore to learn a price impact model. 
Unlike many other reinforcement learning problems, in ours a considerable degree of exploration 
is naturally induced by exploitative decisions. This is because a trader excites the market through 
regular trading activity regardless of whether or not she aims to learn a price impact model. 
This activity could, for example, be triggered by return-predictive factors, and given sufficiently 
large factor variability, the induced exploration might adequately resolve uncertainties about price 



impact. Results of this paper demonstrate that executing trades to explore beyond what would 
naturally occur through exploitation can yield significant benefit. 

Our work is constructive: we propose the confidence-triggered regularized adaptive certainty 
equivant policy (CTRACE), pronounced "see-trace," a new method that explores and learns a 
price impact model alongside trading. CTRACE can be viewed as a generalization of CE, which at 
each point in time estimates coefficients of a price impact model via least-squares regression using 
available data and makes decisions that optimize trading under an assumption that the estimated 
model is correct and will be used to guide all future decisions. CTRACE deviates in two ways: (1) 
£2 regularization is applied in least-squares regression and (2) coefficients are only updated when a 
certain measure of confidence exceeds a pre-specified threshold and a minimum inter-update time 
has elapsed. Note that CTRACE reduces to CE as the regularization penalty, the threshold, and 
the minimum inter-update time vanish. 

We demonstrate through Monte Carlo simulation that CTRACE outperforms CE. Further, we 
establish a finite-time regret bound for CTRACE; no such bound is available for CE. Regret is 
defined here to be the difference between realized risk-adjusted profit of a policy in question and 
one that is optimal with respect to the true price impact model. Our bound exhibits a poly- 
logarithmic dependence on time. Among other things, this regret bound implies that CTRACE is 
efficient in the sense that the (e, 5)-convergence time is bounded by a polynomial function of 1/e 
and log(l/(^) with high probability. We define the (e, 5)-convergence time to be the first time when 
an estimate and all the future estimates following it are within an e-neighborhood of a true value 
with probability at least 1 — 6. Let us provide here some intuition for why CTRACE outperforms 
CE. First, regularization enhances exploration in a critical manner. Without regularization, we 
are more likely to obtain overestimates of price impact. Such an outcome abates trading and 
thus exploration, making it difficult to escape from the predicament. Regularization reduces the 
chances of obtaining overestimates, and further, tends to yield underestimates that encourage 
active exploration. Second, requiring a high degree of confidence reduces the chances of occasionally 
producing erratic estimates, which regularly arise with application of CE. Such estimates can result 
in undesirable trades and/or reductions in the degree of exploration. 



It is also worth comparing C 



Abbasi-Yadkori and Szepesvari 



RACE to a reinforcement learning algorithm recently proposed in 



2010( 1 which appears well-suited for our problem. This algorithm 



was designed to explore efficiently in a broader class of line ar-quadratic control problems, and is 



based on the principle of optimism in the face of uncertainty. 



Abbasi-Yadkori and Szepesvari 
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establish an 0{^/Tlog{T/6)) regret bound that holds with probability at least 1—5, where T denotes 
time and some logarithmic terms are hidden. Our bound for CTRACE is on expected regret and 
exhibits a dependence on T of O(log^T). We also demonstrate via Monte Carlo simulation that 
CTRACE dramatically outperforms this algorithm. 

To summarize, the primary contributions of this paper include: 

(a) We propose a new method for simultaneous execution and learning - the confidence-triggered 
regularized adaptive certainty equivalent (CTRACE) policy. 

(b) We establish a finite-time expected regret bound for CTRACE that exhibits a poly-logarithmic 
dependence on time. This bound implies that CTRACE is efficient in the sense that, with 
probability I — S, the (e, 5)-convergence time is bounded by a polynomial function of 1/e and 
log(l/5). 

(c) We demonstrate via Monte Carlo simulation that CTRACE outperfor ms the certainty equiva- 



ent p olicy and a reinforcement learning algorithm recently proposed by 



Abbasi-Yadkori and Szepesvari 



2010( 1 which is designed to explore efficiently in linear-quadratic control problems. 



The organization of the rest of this paper is as follows: Section [2] presents our problem for- 
mulation, establishes existence and uniqueness of an optimal solution to our problem, and defines 
performance measures that can be used to evaluate policies. In Section [21 we propose CTRACE and 
derive a finite-time expected regret bound for CTRACE along with two properties: inter-temporal 
consistency and efficiency. Section [His devoted to Monte Carlo simulation in which the performance 
of CTRACE is compared to that of two benchmark policies. Finally, we conclude this paper in 
Section [5l All proofs are provided in Appendix. Detailed proofs are available upon request. 



2 Problem Formulation 

2.1 Model Description 

Decision Variable and Security Position We consider a trader who trades a single security over 
an infinite time horizon. She submits a market buy or sell order at the beginning of each period 
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of equal length, G R represents the number of shares of the security to buy or sell at period t 
and a positive (negative) value of Uf denotes a buy (sell) order. Let xt-i G M denote the trader's 
pre-trade security position before placing an order ut at period t. Therefore, xt = xt-i +ut, t> 1. 
Price Dynamics The absolute return of the security is given by 

M 



^Pt =Pt- Pt-l = 9^ ft-l + >^*Ut + ^ 7m{drn,t " dm,t-l) + Ct 

m=l 

t 

dm,t = "^m'Ui = rmdm,t-l + Ut, dt = [di^t ■ ■ ■ dAf^tV . 



(1) 



We will explain each term in detail as we progress. This can be viewed as a first-order Taylor 
expansion of a geometric model 



log 



Pt 
Pt^i 



M 



ft-l + X*Ut+ Tm{dm,t - dm,t-l) + 



m=l 



over a certain period of time, say, a few weeks in calendar time, which makes this approximation 
reasonably accurate for practical purposes. Although it is unrealistic that the security price can 
be negative with positive probability, our model nevertheless serves its practical purpose for the 
following reasons: Our numerical experiments conducted in Section [J] show that price changes after 
a few weeks from now have ignorable impacts on a current optimal action. In other words, optimal 
actions for our infinite-horizon control problem appear to be quite close to those for a finite-horizon 
counterpart on a few week time scale. Furthermore, it turns out that in simulation we could learn 
a unknown price impact model fast enough to take actions that are close to optimal actions within 
a few weeks. Thus, learning based on our price dynamics model could also be justified. We will 
give concrete numerical examples later to support these notions. 

Price Impact The term X*ut represents "permanent price impact" on the sec urity price of a 
current trade. The permanent price impact is endogenously derived in 



Kvle 



1985l | from informa- 



tional asyrn metry between an informed trader and uninformed competitive market makers, and in 



Rosu 



2009( 1 from equilibrium of a limit order market where fu. 



ically choose limit and market orders. 



Hub er man and Stanzl 



l y stra tegic liquidity traders dynam- 



20041 ] prove that the linearity of a 



time-independent permanent price impact function is a necessary and sufficient condition for the 
absence of "price manipulation" and "quasi-aribtrage" under some regularity conditions. 
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The term J2m=i lmdrn,t indicates "transient price impact" that models other traders' responses 
to non-informative orders. For example, suppose that a large market buy order has arrived and 
other traders monitoring the market somehow realize that there is no definitive evidence for abrupt 
change in the fundamental value of the security. Then, they naturally infer that the large buy order 
came merely for some liquidity reason, and gradually "correct" the perturbed price into what they 
believe it is supposed to be by submitting counteracting selling orders. The dynamics of dm t in ([T]) 
indicates that the imp act of a current trade on the security price decays exponentially over time, 
which is considered in 



Obizhaeva and Wang 



20051 ] that incorpor ate the dynami cs of supply and 



demand in a limit order market to optimal execution strategies. In lGatherall [2010(], it is shown that 
the exponentially decaying transient price impact is compatible only with a linear instantaneous 
price impact function in the absence of "dynamic arbitrage." 

Observable Return-Predictive Factors We assume that there are multiple observable return - 



Garleanu and Pedersen 



2009] 



predictive factors that affect the absolute return of the security as in 
Those factors could be macroeconomic factors such as gross domestic products (GDP), inflation 
rates and unemployment rates, security-specific factors such as P/B ratio, P/E ratio and lagged 
returns, or prices of other securities that are correlated with the security price. In our price 
dynamics model, ft G denotes these factors and g E denotes factor loadings. The term 
ft-i represents predictable excess return or "alpha." We assume that ft is a first-order vector 
autoregressive process ft = ^ft~i where <^ G M^^^ is a stable matrix that has all eigenvalues 
inside a unit disk and ut G 1^^^ is a martingale difference sequence adapted to the filtration {J^t = 
a{{xQ,dQ, fojCHi, . . . ,ujt,ei, . . . ,et})}. We further assume that ut is bounded almost surely, i.e. 
W'^tW ^ for all t > 1 for some deterministic constant C^^, and Cov[cjt] J^t_i] = G M^^^ 

being positive definite and independent of t. 

Unpredictable Noise The term et represents random fluctuations that cannot be accounted 
for by price impact and observable return-predictive factors. We assume that et is a martingale 
difference sequence adapted to the filtration {J^t}, and independent of xq, do, /o and Ur for any 
r > 1. Also, E[e^]J^t_i] = G M being independent of t. Finally, each et is assumed to be 
sub-Gaussian, i.e., E[e'xp{aet)\J^t-i] < exp(C^^a^/2), Vt > 1, Va G M for some > 0. 

Policy A policy is defined as a sequence vr = {vri, 7r2, . . .} of functions where irt maps the trader's 
information set at the beginning of period t into an action ut- The trader observes ft-i and pt-i 
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at the end of period t — 1 and thus her information set at the beginning of period t is given by 
It-i = {xq, do, /o, . . . , ft-i,Po, ■ ■ ■ ,Pt-i}- A pohcy n is admissible if zt = [xt dj f^]^ generated 
by Uf — TTiiXt^i) satisfies lim-r^oo 

||zT|P/r = 0. A set of admissible pohcies is denoted by 11. 
Objective Function The trader's objective is to maximize expected average "risk-adjusted" 
profit defined as 




where the first term AptXj_i indicates change in book value and the second term pTj^x^ a quadratic 
penalty for her non-zero security position in the next period that reflects her risk aversion, p is a 
risk-aversion coefficient that quantifies the extent to which the trader is risk-averse. 

Assumptions The following is a list of assumptions on which our analysis is based throughout 
this paper. Let 6* = [A* 7* ... 7X/]~'' G M^^''"^. We will make two more assumptions as we progress. 

Assumption 1. (a) The price impact coefficients 9* are unknown to the trader. Note that they can 
be learned only through executed trades. 

(h) The factor loadings g are known to the trader. This is a reasonable assumption since they can 
be learned by observing prices without any transaction. 

(c) The decaying rates r = [ri, . . . , rM]"*" £ [0, 1)*^ 0/ the transient price impact are known to the 
trader and all the elements are distinct. In practice, they are definitely not known a priori. 
However, it can be handled effectively for practical purposes by using a sufficiently dense r with 
a large M so that potential bias induced by modeling mismatch can be greatly reduced at the 
expense of increased variance, which can be reduced by regularization. 

(d) 9* e Q = {6 e R'^'^'^'^ ■ < < Omax, 1^6* > /?} for some Omax > component-wise and some 
/3 > 0. The constraint 1^9 > (3 is imposed to capture non-zero execution costs in practice. 
Note that @ is compact and convex. 

Notations || • || and || • \\f denote the ^2-iiorm and the Frobenius norm of a matrix, respectively, 
o V 6 and a A 6 denote max{a, 6} and min{a, 6}, respectively. For a symmetric matrix A, A y 
means that A is positive definite and A ^0 means that A is positive semidefinite. Amin(^) indicates 
the smallest eigenvalue of A. {A)ij of a matrix A indicates the entry of A in the ith row and in the 
jth column. {v)i of a vector v indicates the ith entry of v. diag(u) of a vector v denotes a diagonal 
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matrix whose ith diagonal entry is {v)i. denotes the jth column of A and Ai-j^k indicates a 
segment of the kth column of A from the ith entry to the jth entry. 1{B} denotes an indicator 
function on the event B. 

2.2 Existence of Optimal Solution 

Now, we will show that there exists an optimal policy among admissible policies that maximizes 
expected average risk-adjusted profit. For convenience, we will consider the following minimization 
problem that is equivalent to maximize expected average risk-adjusted profit. 



min limsup E 



1 



T 



t=l 



We call the negative of average risk-adjusted profit "average cost." This problem can be expressed 
as a discrete-time linear quadratic control problem 



min limsup E 



1 ^ 



£=1 



Q S 

5^ R 



Zt-l 



s.t. zt = Azt^i+But+Wt, Ut = 7rt(Xt_i^ 

where zt = [xt dj ^; = [0 7*T(diag(r) - /) ^^jT^ ^ . . . ^*^^]T^ ei = [1 • • • 0]^, 

Q = pT.^eieJ - ^{vej + eiv^), S = pT.^ei - ^{\* + j*'^ l)ei, R = pT.^, 





1 










1 


















A = 





diag(r) 





, B = 


1 


, Wt = 





, tl ^ Cov[Wt] = 






































n 



Note that R is strictly positive but Q is not necessarily positive semidefinite. Therefore, special care 
should be taken in order to prove the existence of an optimal policy. We start with a well-known 
Bellman equation for average-cost linear quadratic control problems 



H(zt-i) + h = min E 

Ut 



p^e{xt-l + Utf - AptXt-1 + H{zt] 



(2) 
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where H(-) denotes a differential value function and h denotes minimum average cost. It is natural 
to conjecture H{zt) = zj Pzt- Plugging it into ([2]), we can obtain a discrete-time Riccati algebraic 
equation 

P = A^PA + Q-{S^ + B^PA^iR + B^PB)-^{S^ + B^ PA) (3) 

with a second-order optimality condition R + B^ PB > 0. The following theorem characterizes 
an optimal policy among admissible policies that minimizes expected average cost, and proves 
existence and uniqueness of such an optimal policy. 

Theorem 1. For any 9* Q, there exists a unique symmetric solution P to ^ that satisfies 
R + B^PB > and psr{A + BL) < 1 where L = -{R + B^ PB)-^{S~^ + B'^ PA) and psr{-) denotes 
a spectral radius. Moreover, a policy vr = (tti, 7r2, . . .) with TTt{It~i) = Lzt-i is an optimal policy 
among admissible policies that attains minimum expected average cost tr{PQ). 

For ease of exposition, we define some notations: P{0) denotes a unique symmetric stabilizing 
solution to dSD with 9* = 9. L{9) = -{R + B'^ P{9)B)^^{S{9)'^ + B'^ P{9)A) denotes a gain matrix 
for an optimal policy with 9* = 9, G{9) = A + BL{9) denotes a closed-loop system matrix with 
9* = 9, and U{9) = 1L{9) + [A — I O] denotes a linear mapping from to a regressor ipt used in 
least-squares regression for learning price impact, i.e. ipt = U{9)zt^i. Having these notations, we 
make two assumptions about L(9) as follows. Indeed, we can verify through closed-form solutions 
that these assumptions hold in a special case which will be discussed in Subsection 12.31 

Assumption 2. (a) There exists Cl > such that \\L{9i) — L{92)\\ < Cl\\9i — 92\\ for any 9i,92 G @. 
(b) {L{9))i / and {L{9))m+2 / for any 9 e O 

Using Assumption [51 we can obatin an upper bound on \\zt\\ uniformly over 9 £ @ and t >0. 

Lemma 1. For any < ^ < 1, there exists iV £ N being independent of 9 such that ||G^(0)|| < ^ for 
all 9 £ @. Thus, maxo<i<iv-i supgg0 = Cg is finite. For any fixed 9 £ Q, \\zt\\ < Cg||zo|| + 

C7gC^/(e(l-C^/^)) = C^, Vt > a.s. where zt = G{9)zt-i+Wt. Moreover, sup^g© \\U{9)\\ < Cg+l. 

Note that Lemma [1] can be applied only when 9 is fixed over time. From now on, we assume 
||-2o|| < "^CgCu)/ — i^^^)) without loss of generality otherwise we can always set Cg to be greater 

than um-e"')/{2c^)- 
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Figure 1: (Left) Relative error for Pr'- T = 300 corresponds to 3.8 trading days. (Right) Relative 
error for L{6t) from CTRACE: Period 3000 corresponds to 38 trading days. The verical bars represent 
two standard errors. In both figures, the simulation setting in Section [5] is used. 

Finally, we present concrete numerical examples that support the validity of our price model 
as an approximation of the geometric model for practical purposes. As we discussed earlier, our 
numerical experiments conducted in Section H] show that our infinite-horizon control problem could 
be approximated accurately by a finite-time control problem with a time horizon on a few week 
time scale. To be more precise, we define relative error for Pq as ||Pq — -P||/||P|| where 
denotes a coefficient matrix of a quadratic value function at period t for a finite-horizon control 
problem with a terminal period T, and P denotes a coefficient matrix of a quadratic value function 

(T) 

for our infinite-horizon control problem. As shown in Figure [U the relative error for ' appears 
to decrease exponentially in T and the relative error for p^^^^^ is almost 10~^ where T = 300 
corresponds to 3.8 trading days. 

Furthermore, we could learn unknown 6* fast enough to take actions that are close to optimal 
actions on a required time scale. An action from a current estimate could be quite close to an 
optimal action even if estimation error for the current estimate is large, especially in cases where a 
few "principal components" of L{9) with large directional derivatives with respect to 6 are learned 
accurately. To be more precise, we define relative error for L{6t) as 

n{m)zu-L{e*)zu?] _ {m)-L{e*))iiue*){m)-L{9*)y 

where is a stationary process generated by = L{6*)zl_i and Hzzi^*) = Elz^z^]. The relative 
error for L{9t) indicates how different an action from an estimate 9^ is than an optimal action 
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from the true value 9*. Figure [T] shows how the relative error for L{6t) evolves over time with 
two-standard-error bars when ^^'s are obtained from a new policy that we will propose in Section 
[3l As you can see, all the approximate 95%-confidence intervals lie within ±3% range after Period 
2500 that corresponds to 32 trading days. It implies that actions from estimates learned over a few 
weeks could be sufficiently close to optimal actions. 

2.3 Closed-Form Solution: A Single Factor and Permanent Impact Only 

When we consider only the permanent price impact and a single observable factor, we can derive 
an exact closed- form P and L as follows. 



p _ A* - pS, + v/2AVS, + (pS,)2 



P. 



ff 



(1 - $)A* - + $v/2A*pSe + {p^eY 



2(1 - ^>2) (^(1 - <5)2A* + (1 + $2)pS, + (1 - $2)^2AVS, + (pE,)2 

-2pS, 



pS, + V2AVS, + (pS,)2 



(1 - ^')A* + pS, + V2AVS, + (/9S,)2 
Although this is a special case of our general setting, we can get useful insights into the effect of 
permanent price impact coefficient A* on various quantities. Here are some examples: 

• I -La, I and |Lj| are strictly decreasing in A*. 

• limA.-^o L^ = -I, limA.-^oo = 0. 

. limx'^o Lf = g^/{2p^^), limA*_^oo^/ = 0. 

• The expected average risk-adjusted profit —Pjffl is strictly decreasing in A*. 
. Ihnx^^oi- Pff ^) = 9^<^^n/m - <^^)p^e), limx*^^{-Pffn) = 0. 
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2.4 Performance Measure: Regret 

In this subsection, we define a performance measure that can be used to evaluate pohcies. For 
notational simphcity, let L* = L{9*), G* = G{9*) and P* = P{9*). Using ([3]), we can show that 



1 



t=l 



-^P*zo - z^P*ZT + 2^(^-1 + B7Tt{lt^i)y P*Wt + E W;^P*Wt - E xt-iet 



t=i 



t=i 



t=i 



+ Y.(^tilt) - L*zt^iy{R + B^P*B){7rt{lt~i) - L*zt-i) for any policy tt. 



t=i 



First, we define pathwise regret Rj^{zq\Tx) of a policy vr at period T as Jj^{zq\J^x) — Jt (-^oI-^t) 
where 7r|(Xt_i) = L* zl_i and = G*Zf_i + Wt with Zq = zq. In other words, the pathwise regret 
of a policy vr at period T amounts to excess costs accumulated over T periods when applying vr 
relative to when applying the optimal policy vr*. By definition of vr*, the pathwise regret of a policy 
vr at period T can be expressed as 



Rt(.zo\J^t) = zfP*z*T - z'^P'^ZT + E(7rj(Xt„i) - L*zt-i)^{R + B'^ P'' B){'nt{lt^i) - L*zt^ 

t=i 

T T 

+ 2Y,{{Azt-i + i?vrt(Xt_i)) -{A + BL*)zU)' P*Wt + " ^t-i)et. 



t=i 



t=i 



Second, we define expected regret Rj,{zq) of a policy vr at period T as E[Rj^{zo\Tt)]- Taking expec- 
tation of pathwise regret, we can obtain a more concise expression for expected regret because the 
last two terms vanish by the law of total expectation. Hence, we have 



RJ^izo) = E[z^' P*z^ - z^P*zt] + E 



.i=l 



(^t(Xt_i) - L*zt^i)~'{R + B^P*B)i7rt(It^i) - L*zt^i] 



Finally, we define relative regret RJp{zQ) of a policy vr at period T as i?ji(zo) /|tr(P*0)| where tr{P*Q) 
is minimum expected average cost for 9* . Our choice of performance measure will be either expected 
regret or relative regret in the rest of this paper. 
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3 Confidence-Triggered Regularized Adaptive Certainty Equivalent 
Policy 

Our problem can be viewed as a special case of reinforcement learning, which focuses on sequential 
decision-making problems in which unknown properties of an environment must be learned in 
the course of taking actions. It is often emphasized in reinforcement learning that longer-term 
performance can be greatly improved by making decisions that explore the environment efficiently 
at the expense of suboptimal short-term behavior. In our problem, a price impact model is unknown, 
and submission of large orders can be considered exploratory actions that facilitate learning. 

Certainty equivalent control (CE) represents one extreme where at any time, current point 
estimates are assumed to be correct and actions are made accordingly. Although learning is carried 
out with observations made as the system evolves, no decisions are designed to enhance learning. 
Thus, this is an instance of pure exploitation of current knowledge. In our problem, CE estimates 
the unknown price impact coefficients 9* at each period via least-squares regression using available 
data, and makes decisions that maximize expected average risk-adjusted profit under an assumption 
that the estimated model is correct. That is, an action ut for CE is given by ut = L{9t-i)zt-i where 
0t^i = argmingge J2lzl [i^Pi - 9^ fi-i) - i^J^) with a regressor ipi = [m {di - di-if]^. 

An important question is how aggressively the trader should explore to learn 6* . Unlike many 
other reinforcement learning problems, a fairly large amount of exploration is naturally induced 
by exploitative decisions in our problem. That is, regular trading activity triggered by the return- 
predictive factors ft excites the market regardless of whether or not she aims to learn price impact. 
Given sufficiently large factor variability, the induced exploration might adequately resolve uncer- 
tainties about price impact. However, we will demonstrate by proposing a new exploratory policy 
that executing trades to explore beyond what would naturally occur through the factor-driven 
exploitation can result in significant benefit. 

Now, let us formally state that exploitative actions triggered by the return-predictive factors 
induce a large degree of exploration that could yield strong consistency of least-squares estimates. 
It is wo rth noting that pu re ex ploitation is not suffic ient for strong consistency in other problems 



such as 



Lai and Wei 



19861 ] and 



Chen and Guo 



1986 1 . 



Lemma 2. For any 6 e Q, let ut = L{9)zt-i, zt = G{0)zt-i + Wt and ipj = ut {dt — dt-i) 
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{U{e)zt-i)'^ . Also, let Ilzz{9) denote a unique solution to U^ziO) = G{e)Ilzz{0)G{9)'^ + tl. Then, 

lim - iPt^J = U{9)UzMUie)'^ y a.s. (4) 



Moreover, we can show that Ilzz{d) is continuous on G by proving uniform convergence of 



T ^1=1 zt-iZtLi\ to Uzz{0) on G. Continuity leads to A^^ ^ infgge Amin [U{9)Uzz{9)U{e) ' j > 
which will be used later. 

Corollary 1. I[zz{9) is continuous on G and X^^ = inig^Q Xrnin{u{9)Ilzz{9)U{9)^^ > 0. 

Lemma [5] implies that Amin {j2t=i ''Pfpj^ increases linearly in time T a.s. asymptotically. In 
addition, we can obtain a similar result for a finite-sample case: There exists a finite, deterministic 
constant Ti{9,6) such that Amin {j2t=i'^t'>Pj^ grows linearly in time T for all T > Ti{9,6) with 
probability at least 1 — 5. This is a crucial result that will be used for bounding above "(e,5)- 
convergence time" later. It is formally stated in the following lemma. 

Lemma 3. For any 9 € Q, let ut = L{9)zt-i, zt = G{9)zt-i + Wt and ipj = ut {dt — rft-i) 
{U{9)zt-i)^ ■ Then, there exists an event B{5) such that on B{5) with Pr{B{5)) >\ — 6 

lu{9)Uzz{9)U{9y ^^Y.^ti'J ^ ^U{9)Uzz{9)U{9)'^ VT > Ti{9,6) where 

t=i 



Furthermore, we can extend Lemma [2] in such a way that Amin (j2t=i'^t'^J^ still increases to 
infinity linearly in time T for time- varying {9t} adapted to {a (It)} as long as 9t remains sufficiently 
close to a fixed ^ G G for all t > 0. Here, cr(Xt) denotes a cr-algebra generated by Zt and 9t is (riXt)- 
measurable for each t. 

Lemma 4. Consider any 9 £ Q and {9t £ Q} adapted to {a{Zt)} such that \\9t — 9\\ < a.s. 

(u%l-uTrfXrain{Ilzz{9))u%l-uTrfXrnin{U{9)Uzz{9)U{9y)^ 

where ri = tj—^ A A irr-r 

\ 42NC^+^C^ 42NC^+^C^{l + \\Ui9W NC^~^ ) 

for all t > and any v £ (^,1). Let ut = L{9t-i)zt-i, zt = G{9t-i)zt-i + Wt and i/^J = 
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ut W-rfl-ir] = (C(«l-l)zi-l)"^. Then. 

Similarly to LemmaEl we can obtain a finite-sample result for LemmaHl This result will provide 
with a useful insight into how our new exploratory policy operates in the long term. 

Lemma 5. Consider {9t G G} defined in Lemma\^ Let ut = L{9t-\)zt-i, zt = G{6t-i)zt-i + Wt 



and ipJ 



Ut {df — df^i)~^ = {U{9t_i)zf_i)'^ . Then, for any < S < 1 on the event B{S) in 



LemmalEwith Pr{B{5)) >l - 5 



A 



mm 



It is challenging to guarantee that all estimates generated by CE are sufficiently close to one 
another uniformly over time so that Lemma [Hand Lemma O can be applied to CE. In particular, 
CE is subject to overestimation of price impact that could be considerably detrimental to trading 
performance. The reason is that overestimated price impact discourages submission of large orders 
and thus it might take a while for the trader to realize that price impact is overestimated due to 
reduced "signal-to-noise ratio." To address this issue, we propose the confidence-triggered regularized 
adaptive certainty equivalent policy (CTRACE) as presented in Algorithm [H CTRACE can be 
viewed as a generalization of CE and deviates from CE in two ways: (1) ^2 regularization is 



Algorithm 1 CTRACE 

Input: 6*0, xq, tio, r, g, k, C„, t, L(-), 6lmax, {Pt}^o^ {ft}^o 
Output: {uij^i 

1: Vo ^ k/, to ^ 0, i ^ 1 

2: for t = 1,2, . . . do 

3: Ut ^ L{9t-i)zt-i, xt ^ Xt-1 + Ut, dt ^ diag(r)(i(_i + lut 
4: V'f ^ [ut {dt - dt^iYY, Vt ^ Vt^i + i'ti'J 
5: if Amm(T4) > K + Cyt and t > ti^i + r then 

6: Ot ^ argmingge Ei=i {i^Pi - 9^ fi-i) - i^J + i^Pf, ti^t,i^i + l 
7: else 
8: Ot ^ Ot^i 
9: end if 
10: end for 
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applied in least-squares regression, (2) coefficients are only updated when a certain measure of 
confidence exceeds a pre-specified threshold and a minimum inter-update time has elapsed. Note 
that CTRACE reduces to CE as the regularization penalty k and the threshold tend to zero, 
and the minimum inter-update time r tends to one. 

Regularization induces active exploration in our problem by penalizing the ^2-iiorm of price 
impact coefficients as well as reduces the variance of an estimator. Without regularization, we 
are more likely to obtain overestimates of price impact. Such an outcome attenuates trading 
intensity and thereby makes it difficult to escape from the misjudged perspective on price impact. 
Regularization decreases the chances of obtaining overestimates by reducing the variance of an 
estimator and furthermore tends to yield underestimates that encourage active exploration. 

Another source of improvement of CTRACE relative to CE is that updates are made based 
on a certain measure of confidence for estimates whereas CE updates at every period regardless 
of confidence. To be more precise on this confide nce measure, we first presen t a high-probability 



confidence region for least-squares estimates from 



Abbasi-Yadkori et al 



2011]. 



Proposition 1 (Corollary 10 of lAbbasi-Yadkori et al.l [20111 ]) 



Pr{e* e St{5), Vi > 1) > 1 - <5 where = + ^ ViV'o = Vf^ ^ ^Pi^JO* + ^ ^,e, 



1=1 



\i=i 



i=l 



St{5) ^{9€ 



'Vt{e -9t)<\ 



\ 



This implies that for any 9 G St{S) 



< 



a 



2 .o.(MM!M^£):l^!V 



By definition, CTRACE updates only when Amin(Vt) > k + Cyt. Amin(Vt) typically dominates 
log(det(T4)) for large t because it increases linearly in t, and is inversely proportional to the 
squared estimation error — That is, CTRACE updates only when confidence represented 

by Amin(Vf) exceeds the specified level k + Cvt. From now on, we refer to this updating scheme 
as confidence-triggered update. Confidence-triggered update makes a significant contribution to 
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reducing the chances of obtaining overestimates of price impact by updating "carefully" only at the 
moments when an upper bound on the estimation error is guaranteed to decrease. 

The minimum inter-update time r G N in Algorithm [1] can guarantee that the closed-loop 
system {zt} from CTRACE is stable as long as r is sufficiently large. Meanwhile, there is no such 
stability guarantee for CE. The following lemma provides with a specific uniform bound on \\zt\\. 

Lemma 6. Under CTRACE with r > iVlog(2Cg/0/log(l/0 

{2Cg + l)CgC^ ^ ^ II , II ^ {Cg + l){2Cg + l)CgC^ ^ 

\\zt\\ < = C^ a.s. and < — = Cj, a.s. vt > 0. 

e(i-c-) e(i-c-) 

Confidence-triggered update yields a good property of CTRACE that CE lacks: CTRACE is 
inter-temporally consistent in the sense that estimation errors \\6t — 6*\\ are bounded with high 
probability by monotonically nonincreasing upper bounds that converge to zero almost surely as 
time tends to infinity. The following theorem formally states this property. 

Theorem 2 (Inter-temporal Consistency of CTRACE). Let {9t} be estimates generated by CTRACE 
with M > 2, T > A^log(2Cg/^)/log(l/^) and < A^^. Then, the ith update time ti in Algorithm 
[I] is finite a.s. Moreover, \\0t — 9*\\ < h, Vt > on the event {9* £ St{6), Vt > 1} where 



2aA/(M+l)log(c2 t/K+M+l)+21og(l/5)+2Ki/2||e,„,,|| 

, 7^=^ if t = ti for some i 

b, = l , 60 = 11^0-^11, 

bt-i otherwise 

and {bt} is monotonically nonincreasing for all t > 1 with limt_>oo h = a.s. 

Moreover, we can show that CTRACE is efficient in the sense that its (e, 5)-convergence time 
is bounded above by a polynomial of 1/e, log(l/(5) and log(l/(5') with probability at least 1 — 6' . 
We define (e, 6) -convergence time to be the first time when an estimate and all the future estimates 
following it are within an e-neighborhood of 9* with probability at least 1 — 5. If e is sufficiently 
small, we can apply Lemma [Hand Lemma O to guarantee that Aniin(^) increases linearly in t with 
high probability after (e, 5)-convergence time and thereby confidence-triggered update occurs at 
every r periods. This is a critical property that will be used for deriving a poly-logarithmic finite- 
time expected regret bound for CTRACE. By Theorem[2l it is easy to see that the (e, (5)-convergence 
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time of CTRACE is bounded above by tAr(e^5.c„) where N{e,S,Cv) is defined as 



2Cj{M + 1) log (C2 U/k + M + 1 + 2 log (l/S) + 2kV2||^^^^|| 
N{e,S,C,)=mi{iGn: ^- ^ ==^ <e 



The following theorem presents the polynomial bound on the (e, 5)-convergence time of CTRACE. 
Theorem 3 (Efficiency of CTRACE). For any e > 0, < 6,6' < 1, t > iVlog(2Cg/0/log(l/0 and 



Cv < IAw)^ on the event B{6') defined in Lemma\^ 



tN{e,5,c,) < T*{6') V T + T2{e, 6, C,„) where 



Tiie, S,C,) 



SC^C^iM + 1) + 4^4CfCl{M + 1)2 + KdCe^ ((M + 1)^/^ + 21og(l/5)) \ (4Ac||e™„,||) = 



Finally, we derive a finite-time expected regret bound for CTRACE that is quadratic in loga- 
rithm of elapsed time using the efficiency of CTRACE and Lemma [5l 



Theorem 4 (Finite-Time Expected Regret Bound of CTRACE). If ir is CTRACE with M > 2, t > 
7,/e)/log(l/a andC, < |A;^, 



A^log(2Cg/0/log(l/6 and Cy < ^Xl^, then for any v G (C, 1) and all T>2, 



RUzo) < 2\\P*\\Cf + iR + B^P*B)CfCl \^ (ri*(T) + t^{T) + 1) \\0maccf + r*^{T)e^ 

2 

T I ... I A/f I 1\ 1 Ol , /'OT^\ I n.AI2\\Q II ' 



r ( 2C,J{M + 1) log (C2 T/k + M + 1) + 21og {2T) + 2^1/2 
+ ■ 



C 

X log { ^t^jT-D-iC-CyUrm+r^jT)) \ ^ 
H^ + C(r*(r)-l)-(C7-C.)+(r*(T)+r*(r))y ^ ^ '\ 



where C ^ |A™„(C/(r )n,,(r )?7(r )T), T*{T) = Tf(r) +r|(r) +T3*(T), 



(T) . 8 ( 32(C.*C,)^(M + A- + 1)X ^ /(A/ + A- + 2)^rX ^ ^ /32(C;C,)^(M + + 1)^ 
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r;{T) 



'SC^C^M + 1) + 4y^4C4C|(M + l)^ + nC^C^e^ ((M + 1)3/2 + 21og(2r)) \ ' (4«;|10„J|)2 



.3*(T) = 8 ( 32(a-C )^(A/ + /.+ l)y /(M + A- + 2m ^ ^ / 32(C.*C 4- + 1) ^ ^ 

3C*(2a + C*) 



VmTTCl \ 42iVCf+iC2 A2NC^+^Cl{l + \\U{e*)\\Y NC^-^j' 

Note that T^iT), t|(T) and t|(T) are all O(logT). Therefore, it is not difficult to see that the 
expected regret bound for CTRACE is 0(log^ T). 

4 Computational Analysis 

In this section, we will compare via Monte Carlo simulation the performance of CTRACE to 
that of two benchmark policies : CE and a reinforcement learning algorithm recently proposed in 



Abbasi-Yadkori and Szepesvari 



2010(], which is referred to as AS policy from now on. AS policy 
was designed to explore efficiently in a broader class of linear-quadratic control problems and 
appears well-suited for our problem. It updates an estimate only when the determinant of Vt is 
at least twice as large as the determinant evaluated at the last update, and selects an element 
from a high-probability confidence region that yields maximum average reward. In our problem, 
AS policy can translate to update an estimate with 9t = argmingg^^^^^^Q tr{P{9)Cl) at each update 
time t. Intuitively, the smaller price impact, the larger average profit, equivalently, the smaller 
tr(P(0)r2) which is the negative of average profit. In light of this, we restrict our attention to 
solutions to min5ig5j(5)p|0 tr(P(^)r2) of the form {atOcon,t S "Sti^) H G : < < 1} where 6con,t 
denotes a constrained least-squares estimate to with £2 regularization. The motivation is to 
reduce the amount of computation needed for AS policy otherwise it would be prohibitive. Indeed, 
the minimum appears to be attained always with the smallest at such that at9con,t £ St{5) H 0, 
which is provable in the special case considered in Subsection 12.31 Note that at can be viewed as a 
measure of aggressiveness of exploration: ctj = 1 means no extra exploration and smaller at implies 
more active exploration. 
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Table 1: Monte Carlo Simulation Setting (1 trading day = 6.5 hours) 



M 


6 


K 


2 


Trading interval 


5 mins 


Initial asset price 


$50 


Hali-liie oi r 


[5, 7.5, 10, 15, 30, 45J mms 


Halt lite ot lactor 


[10, 40] mins 


r 


[ 0.50, 0.63, 0.71, 0.79, 0.89, 0.93 ] 




diag([0.707, 0.917]) 


7 ($/share) 


[0, 6, 0, 3, 7, 5] X 10-« 


A ($/share) 


2 X 10"« 




0.0013 (annualized vol. = 10%) 




diag([l, 1]) 


P 


1 X 10"^ 


^max 


(5 X lO^Ol 




5 X lO"'^ 


9 


[0.006, 0.002] 


T 


3000 38 trading days) 


Sample paths 


600 



Table [T] summarizes numerical values used in our simulation. The signal-to-noise ratio (SNR), 
which is defined as E.[{\ut +Z)m=i lm{dm,t ~ f^m,t-i))^]/E[e^] under ut = L(0*)zj_i, is 0.058 and the 
optimal average profit is $765.19 per period, ej and ojt are sampled independently from Gaussian 
distribution even though ojt is assumed to be bounded almost surely for the theoretical analysis. In 
fact, it turns out that the use of Gaussian distribution for ojt does not make a noticeable difference 
from a bounded case. The regularization coefficient k, the confidence-triggered update threshold 
C„, the minimum inter-update time r and the significance level 5 are chosen via cross-validation 
with realized profit: For CTRACE, k = 1 x 10^\ = 600 and r = 1. For AS policy, k = 1 x 10^ 
and 5 = 0.99. The reason for smaller k and large 5 for AS policy is to keep the radius of confidence 
regions small because the exploration done by AS policy tends to be more than necessary and thus 
costly. 

The left figure in Figure [2] illustrates improvement of relative regret due to regularization. It 
shows the relative regret of CTRACE with varying /t and fixed C^, = 0, i.e. no confidence-triggered 
update. The vertical bars indicate two standard errors in both directions, that is, approximate 
95% confidence intervals. It is clear that the relative regret is reduced as CTRACE regularizes 
more, and the improvement from no regularization to k = 1 x 10^^ is statistically significant with 
approximate 95% confidence level. The right figure in Figure [2] shows improvement achieved by 
confidence-triggered update with varying C„ but fixed K = 1 X 10^^ . As you can see, update based on 
confidence makes a substantial contribution to reducing relative regret further. The improvement 
from Ct, = to Ct, = 600 is statistically significant with approximate 95% confidence level. 

As shown on the left of Figure El CTRACE clearly outperforms CE in terms of relative regret 
and the difference is statistically significant with approximate 95% confidence level. The dominance 
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Relative Regret with Varying K (C =0) Relative Regret with Varying C (K^le+H) 




500 1000 1500 2000 2500 500 1000 1500 2000 2500 

Period Period 



Figure 2: Relative regret with varying k and C„: (Left) Varying k G {0, 2 x 10^", 1 x 10^^} with fixed 
Cy = 0. (Right) Varying C„ € {0, 20, 600} with fixed k=1x 10". 



stems from both regularization and confidence-triggered update as shown in Figure [21 The figure 
on the right shows an empirical distribution of difference between realized profit of CTRACE and 
that of CE over 600 sample paths. Much more realizations are located to the right with respect to 
zero profit. It implies that CTRACE tends to make more profit than CE more frequently. 

Finally, we compare performance of CTRACE to that of AS policy in Figure HI The left figure 
shows that CTRACE outperforms AS policy even more drastically than CE in terms of relative 
regret, and the superiority is statistically significant with approximate 95% confidence level. On 
the right, you can see an empirical distribution of difference between realized profit of CTRACE 



Relative Regret Distribution of Realized Profit Difference (CTRACE - CE) 




500 1000 1500 2000 2500 -1 -0.5 0.5 1 

Period Dollar xio' 



Figure 3: (Left) Relative regret of CTRACE and CE. (Right) Distribution of realized profit of 
CTRACE and CE. The red dotted line represents zero difference. 
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Relative Regret Distribution of Realized Profit Difference (CTRACE - AS) 




500 1000 1500 2000 2500 -1 -0.5 0.5 1 1.5 2 

Period Dollar ^ 

Figure 4: (Left) Relative regret of CTRACE and AS policy. (Right) Distribution of realized profit of 



CTRACE and AS policy. The red dotted line represents zero difference. 

and that of AE over 600 sample paths. It is clear that CTRACE is more profitable than AS policy 
in most of the sample paths. This illustrates that aggressive exploration performed by AS policy 
is too costly. The reason is that AS policy is designed to explore actively in situations where pure 
exploitation done by CE is unable to identify a true model. In our problem, however, a great degree 
of exploration is naturally induced by observable return-predictive factors and thus aggressiveness 
of exploration suggested by AS policy turns out to be even more than necessary. Meanwhile, 
CTRACE strikes a desired balance between exploration and exploitation by taking into account 
factor-driven natural exploration. 

5 Conclusion 

We have considered a dynamic trading problem where a trader maximizes expected average risk- 
adjusted profit while trading a single security in the presence of unknown price impact. Our 
problem can be viewed as a special case of reinforcement learning: the trader can improve longer- 
term performance significantly by making decisions that explore efficiently to learn price impact 
at the expense of suboptimal short-term behavior such as execution of larger orders than appear- 
ing optimal with respect to current information. Like other reinforcement learning problems, it is 
crucial to strike a balance between exploration and exploitation. To this end, we have proposed 
the confidence-triggered regularized adaptive certainty equivalent policy (CTRACE) that improves 
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purely exploitative certainty equivalent control (CE) in our problem. The enhancement is attributed 
to two properties of CTRACE: regularization and confidence-triggered update. Regularization en- 
courages active exploration that accelerates learning as well as reduces the variance of an estimator. 
It helps keep CTRACE from being a passive learner due to overestimation of price impact that 
abates trading. Confidence-triggered update allows CTRACE to have monotonically nonincreasing 
upper bounds on estimation errors so that it reduces the frequency of overestimation. Using these 
two properties, we derived a finite-time expected regret bound for CTRACE of the form 0(log^ T). 
Finally, we have demonstrated through Monte Carlo si mulation that CTRACE outper f orms CE 



Abbasi-Yadkori and Szepesvari 



2010^ 



and a reinforcement learning policy recently proposed in 

As extention to our current model, it would be interesting to develop an efficient reinforcement 
learning algorithm for a portfolio of securities. Another interesting direction is to incorporate a 
prior knowledge of particular structures of price impact coefficients, e.g. sparsity, to an estimation 
problem. It is worth considering other regularization schemes such as LASSO. 



A Proofs 

Proof of Theorem [J Since the evolution of ft is not affected by {xj}, {dt} and {ut}, it is not 
difficult to see that there exists a desired P for our stochastic control problem if there exists P with 
the same properties for a deterministic control problem having no ft and g = 0. Let {A, B, Q, R, S) 
denote reduced coefficient matrices for the deterministic problem of appropriate dimensio ns. Now, 



(A, B ) is controllable an d this problern is a special case of the problem considered in 



Molinari 



1975( 1 . By Theorem 1 in 



Molinari 



19751], there exists a desired P if ^(z) > for all z on the unit 



circle where 



B^{Iz~^ - A^y^ I 
In our problem, it is not difficult to check that for any <j) € (0, 27r), A > and 7,, > 0, 



Q S 
5^ R 



{Iz-A)~^B 
I 



A 



M 



2(1 — cos I 



^ 2 + ^ 1 + r2 

m=l " 



27^(1 - rm COS( 



2rm COS ( 



> 
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and lim^_!.o ^'(e*'^) = cxo > 0. Therefore, the desired result follows. Noting an upper block diagonal 
structure of the original closed-loop system matrix A + BL, we can easily see that the stability 
for the deterministic problem carries over to our original problem. The uniqueness of a stabilizing 
solu tion follows from the stability. For the optimality of vr, we can use the same proof in Chapter 
4 of 



Bertsekas 



20051 1 



Proof of Lemma [T] By Theorem [H psr{G{9)) < 1 for all ^ G 0. Since is a compact set and 
Assumption [5]- (a) implies the continuity of L{6) and G(9), it follows that supfl epi ||G(0)|| < oo and 



supggQ /3sr(G'(0)) < 1. Therefore, by Theorem in 



Buchanan and Parlett 



19661 ] ■ {G"((9)} uniformly 



converges to zero matrix. That is, for any < ^ < 1, there exists N £ N being independent of 9 
such that ||G^(^)|| < ^ for all 9 e 0. Also, maxo<i<Af-i sup^gg ||G*(0)|| < oo by continuity of G{9) 
and compactness of 0. For any t > 0, it is easy to see that ||G*(0)|| < C^^LV^J by definition of Cg 
and N. Since zt = G\9)zq + E*=i G^-\9)Wu 



\zt\\ < iiG'*(^)iiiizoii + E ^ c.eL*/^J|Noii + 

1=1 i=l 

< C,||zo|| + C,a^C^*-*)/^-^ < Cg\\zo\\+CgCjm-e^'')) a.s. 



i=l 



Since = {G{9))i.,M+i,*-[I 0], it follows that < ||(G(0))i:m+i,*|| + ||[/ 0]|| < Cg + 1. ■ 

Proof of Lemma [2] For notational simplicity, let G = G(9), L = L(9) and Ilzz = ^zz{9)- The 



almost-sure convergence in (jl]) follows from Lemma 2 in 



Anderson and Taylor 



19791 1 . It is easy to 



see that U{9) is full-rank since ^ 0. Therefore, it is sufficient to show that Ilzz is positive 
definite. Since G is a stable matrix and n^O, U^z = EZo G^QiG'^f >z E»=o^ = HH^ 



where H 



Thus, it is sufficient to show that H is full-rank. First, 



we will show that {{G)i-m+i,m+2, ■ ■ ■ , {G^^^^)i;M+i,m+2} is linearly independent. We can show by 



induction that (G*)=^,A/+2 = 5'i(n) ••• giirAi) hi] where ^^(r) = (L)a/+2 Em=o(^™)i,i 



„i— 1— m 



and hi = (<I>*)=i,^i. Since each gi{r) is a polynomial of degree i — 1 and its leading coefficient 
is all {L)m+2 7^ 0, we can transform [{G)i:M+i,m+2, ■■■ (G^^+^)i:M+i,a/+2] into Vandermonde 
matrix through elementary row operations. Thus, [{G)i-m+i,m+2 ■■■ (G^^+"^)i:M+i,Af+2] is non- 
singluar. Now, suppose H = for some a G ]R*^+^+^. By definition of H and Q, it implies 
(a)Af+2:A/+A'+i = 0. Then, by nonsingularity of [(G)i:A/+i,a/+2 ••• (G^^+^)i:M+i,Af+2], we may 
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conclude aj.j^jj^-^ = 0. Therefore, a = and we may conclude that H is full-rank. ■ 
Proof of Corollary [1] By Assumption [2]- (a), L{9) is continuous on 6 and so are G{9) and U{9). 
Uniform convergence of E A J2t=i zt~izj_i to Tlzz{G) on follows from the fact that for any e > 



t=i 
IkolP 



t=l 



t=T 



2(T-1) 

2 1 — < e for sufficiently large T independent of 9. 



Since E i ^f^^ zt_izj_^ = ^zqzJ + ^ Zjji E*=o G'n{G^Y is continuous in ^ G 6 for all T > 1, 
the limiting matrix 1122(0) is continuous in G component- wise. Thus, so is U{9)Ilzz{9)U{9)'^ . 
Finally, A^in (u{9)n,z{9)U{9y^ is continuous on 0. Since A^in (y{9)Uz,{9)Ui9y^ > 0, V6' G 
and is a compact set, it follows from its continuity that infggQ Amin {ui9)n,,{9)u{9y) > 0. ■ 
Proof of Lemma [3] Let e, G '\^'^+^+^ denote an elementary vector whose entries are all zero 
except for ith entry being one and rjij^k — ZkzJ ej — ej E[zkzJ \J^k-i]ej , l<i,j,<M + K + l. 
Since |r/ij,fc| < 2(7^ a.s., {rjij^k} is an almost-surely bounded martingale difference process adapted 
to {J^k} and thus it is conditionally sub-Gaussia n with E\ex.^p('yr]ii,k)\J^_ k -^] < exp (7^(2(72)^/2) a.s. 



Hence, if we use a special case of Corollary 1 in 
then for all 1 < i, j < M + X + 1 and any a > 



Abbasi-Yadkori et al 



201 11 ] with nik = 1 for all k, 



Pr 



E '^ii.* 



k=l 



<2C2W(a + t)log 



nu + K + 2)^{a + t) 



Vt > 1 > 1 
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{M + K + 2Y' 



Using r/jjfc = 77^ j and E[2;fe2;J|J^fc_i] = G{9)zk-izl_iG{9)^ + 0, it follows from the union bound 
that Pr {Yt)^. I < e, 1 < i, j < M + K + 1, Vt > t*(J, e, a) j >l-5 where Yt ^ \ ELi ZkzJ - 
G{9) (i ELi G(0)T - and 



On the above event, < \\Yt\\F < {M + K + l)eimd -{M + K+l)eI ^Yt < {M + K+l)eI, Vt > 
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t*{6, e, a). We can rewrite -(M + K + l)e/ ^ Yt as 



1 ^ y G{e) ^ ^fc-i^fcLi^ G{ey + n-{M + K + l)el + Qzo4 - ]^tzj^ ■ 

Repeating n times a process of left-multiplying both sides with G{6), right-multiplying with G{9)~^ 
and adding the resulting inequality into the original one side-by-side, we obtain 

- y: t G^^\e) - y: zk-izj_, G-^\ey + y: G\e)^G\e)' 

^ k=l V k=l ) i=0 

n " /I 1 \ 

- (M + + 1)6 ^ G\e)G\ey + Y T^o^cJ" - 7^i^7 ^^(0)^. 



i=0 i=0 



Note that 



and 



TJi=oG\e)G\e)^ < ELo WGK^W < Cg/(C^(1 - i'^'^))- Taking limit over n and using 



these two inequalities, we have with probability at least 1 — 5 

1 * (GliM + K + l) 1 2GlGl \ 

Setting e = ^^(1 -e2/^)Amin(n,,(^))/(16C72(M + i^ + 1)) and a = 216, we have f ELi ^ 
U,,{e) - Wn^^j ^ > r(5,e,a) V32C72C72/(^2(;L _^2/7V)^^.^(n^^(^)))^ jg ^^^^ g^^^^ 

that e, a) > 32C2c2/(^2(;l _ ^2/Af)Aj„i„(n^^(^))). Similarly, from Ft ^ (M + K + l)e/, we can 
obtain for all t > t* {6, e, a) 



* fc=i V 
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Since Amm(n,,(^))/ r< n,,(0), it follows that |n,,(^) r< ^ ELi -^fc-i^-i =< Tin,,(^) and thus 

lu{e)u,MU{eV ^ eLi zk.^zJ_,u{oV ^ i|f/(0)n,,(0)c/(0)T. ■ 

Proof of Lemma |4] For notational convenience, let G = G{9), Gt = G{9t), U = U{9), Ut = 
U{9t), n,, = U,,{9) and n(i, j) = • • • G,. By definition of G and r?, - G|| < \\B\\\\L{9t) - 



L{9)\\ < y/MTTCLpt - 9\\ < rj, yt > 0. Since zt can be expressed as zt = n(0,t - l)zo + 
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X;*=i n(«, t - l)Wi, we have 



T 1 T / t-1 \ / t-1 ^ ~^ 



t=i t=i \ i=i / \ j=i 

+ - E (n(0, t - 2)zo4n(0, t - 2)T - G'-^zozjG'-^^) • • • (a) 

+ ^ E E (n(0, t - 2)zoWjU{j, t - 2)T - G*-i2oW^^G'*--'-iT) ... (5) 

t=2j=l 

+ |EE(n(i>i-2)^^.4n(o,t-2)^-G*-^'-iiy,4G*-iT) •••(c) 



t=2j=l 



+ ^ E E E (n(i, * - 2) WiW^^n(j, t - 2)^ - G'~'~'w,wjG'-^~'^) •••(d) 



T 

t=2 i=l j=l 

Then, we can show that 



It follows that 

^ E ^ ^ E (g'^'^o + E f G*-i.o + E G'^'^'W, 1 - ^EEilnfiZj. 

t=l t=l \ i=l / \ j=l 

Taking liminf on both sides, liminfT-^00 ^ E?=i ^t-i^t-i ^ - ^"""("^^) / y Vin(n..) j ^^^^ like- 
wise, we can show that hminfT-^00 7 Ef=i V-t^^ ^ UYi^^U^ - ^iinhiiMidLD. [ ^ A^in(t/n,,c/^) ^ 



Proof of Lemma [5] Using the same techniques in the proof of Lemma [3] and Lemma HI we can 
obtain that on the event B{5) with Fr{B{d)) > I — 5, 

^jz^ti^J ^ ui9)^j2iGi0Y-'zo + Y.G{eY~^-'w,) (Gi9Y-'zo + J2GioY~'~'w,] uio)-^ 

t=l t=l \ i=l / \ j=l J 
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2 

Proof of Lemma 11 Using \\zt^+j\\ < Cg^^/^-^\\zu\\ + CgC^/i^l - C'/^)) a.s. for j < U+i - U 
and Cg^^/^-i < i, we can show by induction that \\zt^\\ < 2CgC^/{i{l - i^/^)) a.s. for ah i > 1. 
For any U<t< U+,, \\zt\\ < C^^^*-*^)/^"^ + ^^^/(^(l - ^i/A^)) < C* a.s. Finahy, \\^Pt\\ < 
||W-i)lllki-i||<(Cg + l)C*=C^. ■ 

Proof of Theorem E Given < A^^ < A^in (u{9)U^,{9)U{9yy it is easy to show that 
Pr(ti < oo, Vi > 1) = 1. Using 9t, = argmin.gQ J2f^^ (^(Apj - fj^i) - ^pJ 9)^ + k\\9\\^ = 
argmiuggQ {9—9t-)~^Vt-{9—9t-) and Proposition[Tl we can show that on the event {9* G St{5) Vt > 1} 
for any i > 1 

2cJ (M + 1) log (Cl U/k + M + i) + 21og {l/S) + 2^1/2 
\\eu '0*\\< \\0u - OuW + \\§u ' 0*\\ < V ^ 

For any ti < t < ti+i, ~ ^*|| = ll^ti ~ ^*ll ^ = ^t- It is easy to show through elementary 
calculus that bf . is strictly decreasing in > 1 if M > 2. ■ 
Proof of Theorem [3] Using log(t + M + 1) <^/i+ yjM + 1 for all t > 0, we can show that 



2CJ{M + 1) log (C^t/K + M + l) + 21og (1/5) + 2k^/^\\9^,^\\ 

^ 7= <e, Vt>r2(e,5,a). 

Suppose for contradiction that tAr(e,5.c„) > T^{S') V r + T2(e, 5, C^,) = T*. Let tj be the last update 
time less than T2(e, d, Cy). Then, there is no update time in the interval [ti + 1, T*] by definition of 
^N{e,5,Cv) T2{e,d,Ci,). By definition of t j and Lemma[31 

Amin (Vf,) > Xn^iniVu) + Xrnin ( E ^t^J | > 1^ + CyU + l^^^f* - U) > K + C,f\ 

\t=U+l ) 

It is clear that T* — ti > r. Consequently, T* is eligible for a next update time after ti. It implies 
that iAr(e_5,c„) = ^* but this is a contradiction. ■ 
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Proof of Theorem [4] Note that 



{R + B^P*B) J2mOt-i) - L{e*))zt-if <{R + B^P*B)CfCl ^ \\9t-i - 9 



*l|2 



t=l 



t=l 



Set 6 = 1/T. Then, on the event A{T) = {9* G 5t(l/(2T)) Vt > 1} n B{1/{2T)) with Pr {A{T)) > 
1 — 1/T, we have 



r*(T)+r*(T) r-(T) 

t=r*(T)+r2*(T)+l 



t=l 



By Lemma O 



,i/(2T),C„) J + '^min 



> K + C(t - 1) - (C - a)+(rr(r) + T^iT)), \ft > T*{T). 



yi=tjv{£,i/(2T),C„)+l 



Therefore, 



E ii^-i 

t=r*(T) + l 



< 



E 



t=r*(T) + l 



< 



T (^2C,y (M + 1) log (Cl T/k + a/ + l) + 21og (2T) + 2k1/2||0^^^^^||^ 



X log 



C 

K + C{T~l)~{C~ C,)+{Tt {T) + r| (r)) 

K + (7(r*(r) - 1) - (c - a)+K(r) + r*(r)) 



2^i/2|l, 



Let g = Pr {A{T)), Lt = L{9t) and L* = L{9*). Then, 



r T 



R^{zq) = ^[z^P*z*T - z^P*zt] + E 



J2iLt-iZt-i - L*zt-iy{R + B'^P*B){Lt-iZt^i - L* zt-i] 



U=i 



<2||P*||C,^ + gE 

T 



J2{Lt~iZt^i-L*zt^i)^{R + B^P*B){Lt-iZt^i-L*zt^i: 



Lt=l 



A{T) 



+ (1 



J2{Lt-iZt-i - L*zt^i)^{R + B^P*B){Lt^iZt-i - L*zt-i) 



t=i 



< 2||P*||C2 + E 



J2iLt-iZt-i - Vzt^iYiR + B^P*B){Lt^iZt-i - L* zt-i 



AiT) 
AiT) 



t=i 
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+ {R + B~^P*B)C^Cl\\9, 



max 



2 




) 



< 2||P*||C2 + {R + B'^P*B)C^^Cl[ {t*{T)+t*{T) + 1) \\9n,^J^ + r^{T)e^ 



+ 



T [2c J {M + 1) log (Cl TjK + M + 1) + 21og (2r) + 2kV2||0, 




C 
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